Microbiome Bioinformatics Snakemake workflow
…BRIEF INTRO IN PROGRESS…
A tentative snakemake workflow that defines qiime2 bioinformatics rules in a DAG (directed acyclic graph) format. A detailed interactive snakemake HTML report is available here. You will be able to explore the workflow and the associated statistics. You can close the left bar (overlap) to get a more expansive display view.
Getting started with QIIME 2 pipeline
Get QIIME2 YAML file
wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
Create qiime2 env and install qiime2
Current YAML file: qiime2-2023.2-py38-osx-conda.yml available
The qiime2 YAML file contains over 500 dependencies. Listed below is just a few QIIME 2 framework dependencies to get the installation started.
name: qiime2202320
channels:
- qiime2/label/r2023.2
- conda-forge
- bioconda
- defaults
dependencies:
- q2cli=2023.2.0
- qiime2=2023.2.0
- python=3.8.16
- q2-alignment=2023.2.0
- q2-composition=2023.2.0
- q2-cutadapt=2023.2.0
- q2-dada2=2023.2.0
- q2-deblur=2023.2.0
- q2-demux=2023.2.0
- q2-diversity=2023.2.0
- q2-diversity-lib=2023.2.0
- q2-emperor=2023.2.0
- q2-feature-classifier=2023.2.0
- q2-feature-table=2023.2.0
- q2-fragment-insertion=2023.2.0
- q2-gneiss=2023.2.0
- q2-longitudinal=2023.2.0
- q2-metadata=2023.2.0
- q2-mystery-stew=2023.2.0
- q2-phylogeny=2023.2.0
- q2-quality-control=2023.2.0
- q2-quality-filter=2023.2.0
- q2-sample-classifier=2023.2.0
- q2-taxa=2023.2.0
- q2-types=2023.2.0
- q2-vsearch=2023.2.0
Installing QIIME2 using a bash script
conda activate base
wget https://data.qiime2.org/distro/core/qiime2-2023.2-py38-osx-conda.yml
conda env create -n qiime2-2023.2 --file qiime2-2023.2-py38-osx-conda.yml
conda activate qiime2-2023.2
qiime info
Downloading demo data
Demo data from one of QIIME 2[1] tutorials.
mkdir -p resources
mkdir -p resources/reads
mkdir -p resources/references
cd mkdir -p resources/reads
wget \
-O "sample-metadata.tsv" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/sample_metadata.tsv"
wget \
-O "emp-paired-end-sequences/forward.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/forward.fastq.gz"
wget \
-O "emp-paired-end-sequences/reverse.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/reverse.fastq.gz"
wget \
-O "emp-paired-end-sequences/barcodes.fastq.gz" \
"https://data.qiime2.org/2023.2/tutorials/atacama-soils/10p/barcodes.fastq.gz"
Download a QIIME 2 trained classifer
wget \
-O "gg-13-8-99-515-806-nb-classifier.qza" \
"https://data.qiime2.org/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza"
Other classifiers also exist. Check on QIIME2 website for more information.
Overview of QIIME 2 classification methods
1. De novo clustering
Sequences are clustered against one another.
Closed-reference clustering
Here the clustering is performed at 99% identity against the Greengenes reference database.
Open-reference clustering
Here the clustering is performed at 99% identity against the Greengenes reference database.
Alignment of representative sequences
- The MAFFT (Multiple Alignment using Fast Fourier Transform) software provides alignments of the representative sequences.
- Then we will run alignment mask function to remove poor alignments.
Identifying and filtering chimeric feature sequences
Citation
Please consider citing the iMAP article[2] if you find any part of the IMAP practical user guides helpful in your microbiome data analysis.
References
Appendix
Project main tree
.
├── LICENSE.md
├── README.md
├── config
│ ├── config.yml
│ ├── pbs
│ ├── samples.tsv
│ ├── slurm
│ └── units.tsv
├── dags
│ ├── rulegraph.png
│ └── rulegraph.svg
├── data
│ ├── README.md
│ ├── logs
│ ├── metadata
│ ├── mothur
│ ├── qiime
│ ├── reads
│ ├── references
│ └── test
├── images
│ ├── bioinformatics.png
│ ├── bkgd.png
│ ├── imap_part02.svg
│ ├── imap_part03.svg
│ ├── imap_part04.svg
│ ├── imap_part05.svg
│ ├── silvaalign.png
│ ├── smkreport
│ └── sra_config_cache.png
├── index.Rmd
├── library
│ ├── apa.csl
│ ├── export.bib
│ ├── imap.bib
│ └── references.bib
├── qiime2_process
│ ├── aligned-rep-seqs.qza
│ ├── demux.qza
│ ├── demux.qzv
│ ├── feature-table-dn-99.qza
│ ├── feature-table.qza
│ ├── feature-table.qzv
│ ├── masked-aligned-rep-seqs.qza
│ ├── masked-aligned-rep-seqs.qzv
│ ├── new-ref-seqs-or-85.qza
│ ├── rep-seqs-cr-85.qza
│ ├── rep-seqs-dn-99.qza
│ ├── rep-seqs-or-85.qza
│ ├── rep-seqs.qza
│ ├── rep-seqs.qzv
│ ├── rooted-tree.qza
│ ├── sample-metadata.qzv
│ ├── stats.qza
│ ├── stats.qzv
│ ├── table-cr-85.qza
│ ├── table-or-85.qza
│ ├── taxa-bar-plots.qzv
│ ├── taxonomy.qza
│ ├── taxonomy.qzv
│ ├── unmatched-cr-85.qza
│ └── unrooted-tree.qza
├── report.html
├── resources
│ ├── 85_otus.qza
│ ├── final_fasta
│ ├── gg-13-8-99-515-806-nb-classifier.qza
│ ├── metadata
│ ├── reads
│ └── test
├── results
│ └── project_tree.txt
├── smk.css
├── styles.css
├── tree.sh
└── workflow
├── Snakefile
├── envs
├── report
├── rules
└── scripts
28 directories, 54 files
Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer
